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1.  INTRODUCTION 


This  report  describes  CMCAM,  a  program  for  performing  lattice-gas  calculations  using  the  Thinking 
Machines  Corporation  CM-S  supercomputer.  CMCAM  represents  part  of  a  continuing  investigation  into 
the  applicability  and  performance  of  contemporary  parallel  supercomputers  for  lattice  gas  simulation. 
The  development  of  this  code  was  guided  by  two  goals.  The  first  goal  was  to  achieve  high  performance 
in  order  to  address  significantly  larger  problems  than  can  be  handled  with  ordinary  high  performance 
workstations.  The  second  goal  was  to  produce  a  code  that  is  easy  to  modify  for  new  experimental 
scenarios.  An  earlier  implementation  done  using  only  high  level  language  facilities  was  deemed  too  slow 
to  be  really  useful.  The  present  implementation  uses  assembler  language  for  the  highest  possible 
performance.  The  use  of  an  assembler  language  certainly  increases  the  complexity  of  the  code,  but 
provides  an  enormous  boost  in  the  delivered  performance.  The  code  has  been  designed  with  the  hope 
that  future  extensions  and  enhancements  to  the  code  will  involve  a  minimum  amount  of  assembly  code 
manipulation.  The  code  has  been  built  with  an  eye  towards  compatibility  with  another  parallel 
supercomputer  architecture,  the  MIT  Information  Mechanics  group’s  CAM-8  machine  [Margolus,  1993]. 
Many  components  of  a  CMCAM  simulation  can  be  re-used  without  change  on  the  CAM-8.  After  reading 
this  report  an  experienced  C  programmer  should  be  able  to  understand  the  techniques  used  to  perform 
lattice  gas  simulation  on  the  CM-S.  The  report  should  also  aid  persons  interested  in  modifying  the  code 
for  future  lattice  gas  scenarios. 


2.  LATTICE  GAS  MODELING 


Lattice-Gas-Automata  (LGA)  models  represent  an  intriguing  alternative  to  conventional  methods  of 
hydrodynamic  simulation.  These  exactly  computable  models  are  based  on  particles  moving  on  a  uniform 
lattice  with  discrete  velocities.  Particles  are  typically  represented  using  individual  bits  to  indicate  their 
presence  or  absence  at  a  particular  site.  The  particles  interact  with  each  other  through  collisions  that 
conserve  the  desired  invariant  quantities,  typically  mass,  momentum  and  energy.  The  collisions  contain 
the  physics  of  the  particular  system  under  study. 

An  LGA  time  step  can  be  separated  into  two  phases.  The  first  phase  involves  streaming  of  the  particles 
to  their  new  locations,  consistent  with  their  velocity  and  the  lattice  on  which  the  simulation  is  being 
performed.  Once  all  the  particles  are  at  the  appropriate  lattice  sites  they  interact,  according  to  the 
specified  "rules"  of  the  simulation.  After  the  collision  process  the  streaming  step  is  repeated. 

LGA  are  numerically  stable  methods  that  are  able  to  easily  accommodate  highly  irregular  boundary 
conditions.  The  FHP  model  developed  by  Frisch  Hasslacher  and  Pomeau  rigorously  shows  2-D  Navier 
Stokes  flow  in  the  incompressible  limit  [Frisch,  et  al.,  1986].  LGA  modeling  methodology  has  been 
extended  to  3  dimensional  hydrodynamic  flows  with  the  advent  of  the  FCHC  model  [ d’Humieres ,  et  al., 
1986].  Due  to  the  uniformity  and  concurrency  of  the  lattice  gas  update  process,  implementation  on 
parallel  computers  is  usually  efficient,  requiring  only  local  communication.  LGA  modeling  offers  many 
substantial  advantages  over  conventional  finite-difference  techniques  and  it  is  attracting  increasing 
attention  as  a  promising  new  approach  to  fluid  flow  [Doolen,  1990], 


3.  CM-5  ARCHITECTURE 


The  Thinking  Machines  Corporations’s  CM-5  is  a  massively  parallel  computer  that  can  contain  up  to 
16384  processing  nodes  [Thinking  Machines  Corp. ,  1992].  Figure  1  shows  an  individual  processing  node 
consisting  of  a  SPARC  CPU,  32  Mbytes  of  memory  and  4  Vector  processing  units. 


8  Mb 
RAM 

8  Mb 
RAM 

8  Mb 
RAM 

8  Mb 
RAM 

r  i i 

128  Mb/sac  128  Mb/sac  128  Mb/sec  128  Mb/sec 

_ 1 _ 1 _ 1 _ L 

Vector 

Unit 

Vector 

Unit 

Vector 

Unit 

Vector 

Unit 

SPARC 

CPU 


Communication 

Network 

Interface 


64-bit  bus 


Figure  1.  CM-5  Node  Architecture 


These  processing  nodes  are  all  connected  via  a  "fat-tree"  communications  network  that  allows  fast 
inter-node  communication.  These  processing  nodes  are  controlled  by  a  front-end  host  computer  which 
is  a  modified  SUN  workstation.  The  SPARC  processor  on  each  node  issues  instructions  to  the  vector 
units  and  performs  most  address  bookkeeping  tasks  while  the  vector  units  perform  arithmetic  and  logical 
operations  on  the  data.  Each  vector  unit  has  a  peak  rate  of  32  million  64-bit  ops  (floating  point  or 
integer)  for  a  combined  total  of  128  Mops/node.  Each  node’s  memory  is  divided  into  8  Mbyte  banks, 
one  for  each  vector  unit.  The  banks  of  memory  are  mapped  into  distinct  parts  of  the  address  space, 
inter-bank  communication  is  be  mediated  by  the  SPARC  processor  in  most  cases.  Each  vector  unit  has 
it’s  own  independent  128  Mbyte/sec  path  to  memory  for  a  combined  memory  bandwidth  of  512 
Mbyte/sec  for  each  node.  The  vector  units  also  act  as  high  performance  memory  interfaces  when  their 
arithmetic  and  logical  capabilities  are  not  being  used.  The  CM-5  at  the  Army  High  Performance 
Computing  Research  Center  in  Minneapolis,  Minnesota  currently  contains  512  nodes  for  a  total  of  16  Gb 
of  memory  and  64  Gops  of  peak  processing  speed.  A  CM-5  at  Los  Alamos  National  Labs  contains  1024 
nodes,  for  twice  the  capacity. 

CMCAM  is  implemented  on  the  CM-5  in  a  Multiple  Instruction  Multiple  Data  (MIMD)  style.  The 
CMMD  message  passing  library  is  used  for  inter-node  communication  and  host-node  interaction  [Thinking 
Machines  Corp.,  1993a].  In  order  to  get  the  highest  possible  performance  the  vector  units  on  each  node 
are  explicitly  manipulated  using  their  assembler  language  known  as  DPEAC.  To  ease  the  burden  of  hand 
coding  the  vector  units  a  macro  package  known  as  GCC/DPEAC  is  used  [Thinking  Machines  Corp., 
1993b].  This  package  uses  features  available  in  the  GNU  C  compiler  to  issue  assembler  language 
instructions  from  ANSI  C  and  simplifies  matters  considerably. 
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We  partition  the  problem  space  into  equally  sized  rectangular  units.  Figure  2  shows  this  partitioning  for 
N  nodes.  Each  processing  node  is  responsible  for  updating  one  of  these  rectangular  units.  This 
partitioning  allows  one  to  send  a  small  number  of  long  messages  to  connect  the  space  together. 
Inter-node  communication  is  only  necessary  along  one  of  the  axes  of  the  problem  space.  Since  the 
inter-node  communications  network  is  optimized  for  long  message  lengths  we  expect  that  this  partitioning 
will  make  effective  use  of  available  communications  bandwidth.  This  approach  also  substantially  reduces 
code  complexity.  Within  a  processing  node,  each  of  the  4  vector  units  is  responsible  for  updating  it’s 
quarter  of  the  space.  Communication  between  each  vector  unit’s  8  Mbyte  bank  of  memory  is  mediated 
by  the  SPARC  processor. 
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Figure  2.  Problem  Space  Partitioning 


There  are  two  distinct  phases  of  a  lattice  gas  update  cycle.  The  first  phase  is  the  collision  phase  where 
particles  interact,  this  amounts  to  a  local  permutation  of  the  data  at  a  particular  site.  The  second  phase 
involves  streaming  of  the  bits  to  their  new  locations,  consistent  with  their  velocity  and  the  lattice  on 
which  the  simulation  is  being  performed.  In  most  lattice  gas  models  all  collisions  can  happen 
concurrently  and  all  sites  can  stream  their  data  concurrently,  as  well. 


4.  STRUCTURE  OF  A  CM5  APPLICATION 


There  are  several  different  paradigms  for  implementation  of  a  particular  application  on  the  CM5.  There 
are  languages  such  as  C*  and  CM-FORTRAN  that  insulate  the  implementor  from  explicit  management 
of  processors  and  allow  the  use  of  high  level  concepts  and  structures  in  a  parallel  environment.  C*  and 
CM-FORTRAN  have  a  Single  Instruction  Multiple  Data  (SIMD)  approach  to  parallelism.  Typically  there 
is  a  substantial  performance  loss  when  using  these  languages,  as  the  compilers  introduce  substantial 
overhead  in  their  management  of  the  available  resources. 

There  is  another  approach  to  parallel  computing  known  as  Multiple  Instruction  Multiple  Data  (MIMD). 
This  model  contains  several  processors,  each  executing  it’s  own  stream  of  instructions  that  communicate 
with  other  processors  via  message  passing.  This  type  of  model  is  available  through  the  use  of  the  CM5 
message  passing  library  known  as  CMMD.  CMMD  provides  communication  and  synchronization 
primitives  that  can  be  used  to  produce  a  MIMD  application.  These  primitives  can  be  manipulated  from 
a  standard  high-level  language  such  as  ANSI  C. 
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For  many  types  of  problems  this  explicit  management  of  the  communications  resources  can  be  more 
efficient  than  implementation  in  the  SIMD  high  level  languages.  Lattice  gas  simulation  tests  have  been 
performed  with  the  CMCAM  code  that  show  a  speed  gain  of  a  factor  of  25  over  a  C*  implementation 
[Yepez,  et  al.,  1994].  Since  a  main  goal  of  a  lattice  gas  implementation  on  the  CM5  has  been  to  produce 
an  application  with  extremely  high  performance,  the  CMMD  message  passing  model  has  been  chosen  and 
will  be  discussed  here.  The  discussion  here  is  based  on  Thinking  Machines’  CMMD  documentation 
[Thinking  Machines  Corp. ,  1993a]. 

There  are  two  styles  for  programming  a  CM-5  application  using  the  facilities  provided  by  CMMD.  There 
is  a  hostless  and  a  host/node  style.  The  host/node  style  has  been  used  here.  In  a  host/node  type 
application  there  are  two  separate  communicating  main  programs.  There  is  a  program  that  runs  on  the 
front-end  host  machine  and  a  second  program  that  is  replicated  on  all  the  processing  nodes.  Since 
CMCAM  performs  such  tasks  as  remote  file  transfer  and  X-window  display  the  host/node  paradigm  is 
the  most  appropriate.  The  X-window  code  and  all  the  disk  I/O  can  be  centralized  in  the  program  that 
runs  on  the  front-end  host. 

The  code  is  divided  into  several  types  of  modules.  Modules  with  a  .cp.c  extension  are  ANSI  C  program 
modules  that  run  on  the  front  end  host.  These  modules  are  typically  concerned  with  disk  I/O,  X-Window 
machinations  or  supplying/gathering  simulation  data  to/firom  node  code.  Modules  with  a  .pn.c 
designation  are  modules  that  make  up  the  program  that  runs  on  each  processing  node.  Modules 
designated  .cdp.c  contain  GCC/DPEAC  statements  that  can  be  directly  mapped  to  VU  assembly  language 
instructions  for  the  nodes. 

A  Makefile  handles  the  compilation  and  linking  tasks  for  the  many  types  of  modules.  Briefly,  the  .cp.c 
modules  are  compiled/linked  into  a  host  program  while  the  .pn.c  modules  are  compiled/linked  into  a  node 
program.  The  .cdp  files  are  first  compiled  using  gcc  with  flags  that  reserve  some  registers  for  exclusive 
use  of  the  vector  units.  This  initial  compilation  produces  a  file  that  can  be  run  through  dpas,  the  VU 
assembler.  The  output  of  dpas  is  a  .pn.o  object  module  that  can  be  linked  with  the  node  code.  After 
linking  with  uncountable  libraries,  the  host  and  node  programs  are  forged  into  a  common  executable  that 
can  be  handled  by  the  CM-5  run  time  system.  Figure  3  shows  the  header  file  inheritance  scheme  for  the 
host  and  node  code.  This  scheme  effectively  partitions  the  definitions  and  declarations  on  both  the  host 
and  node  sides  of  the  code,  while  allowing  for  some  definitions  to  be  commonly  shared. 
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Figure  3.  Header  File  Include  Scheme 


5.  STREAMING 


The  most  time  consuming  part  of  the  code  is  the  streaming.  Even  though  each  cell  only  communicates 
with  its  near  neighbors  during  an  LGA  update  cycle  the  communications  part  of  the  code  is  substantial, 
as  individual  bits  must  be  picked  out  of  words  and  reassembled  into  other  words.  In  this  CM-5 
implementation  there  are  two  types  of  boundaries  that  the  particles  must  move  across.  Particles  must 
move  across  VU  boundaries,  since  each  VU  has  its  own  bank  of  memory.  Particles  must  also  move 
between  processing  nodes  in  the  machine. 

The  streaming  completely  defines  the  lattice  structure  of  the  simulation.  The  first  type  of  boundary  is 
between  each  vector  unit  on  a  CM-5  processing  node.  Communication  between  the  VU’s  is  mediated 
by  the  SPARC  processor.  Values  are  read  from  the  registers  on  each  VU,  transferred  to  a  SPARC 
register  and  then  written  to  the  appropriate  VU.  This  transfer  takes  place  at  each  streaming  step  of  the 
calculation  and  is  accomplished  using  the  dpread  and  dpload  CDPEAC  macros. 

At  the  edge  of  each  processing  node  there  is  a  different  boundary  that  must  be  crossed,  this  boundary  is 
between  two  processing  nodes.  Communication  between  nodes  involves  the  data  network  and  is  done 
using  the  CMMD  message  passing  library  calls.  This  type  of  communication  is  substantially  slower  than 
the  on-node  communication.  For  VU  0  and  VU  3  there  is  an  additional  step  of  moving  the  data  from 
the  VU  address  space  into  the  SPARC  memory  address  space.  The  communication  must  ensure  that 
every  site  on  every  VU  has  a  directly  accessible  copy  of  the  site  data  at  each  of  its  neighboring  sites. 
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The  process  of  updating  each  site  after  the  communication  process  goes  as  follows: 

1.  Each  site  loads  pointers  to  its  neighbors  from  pre-computed  addressing  tables.  These  tables  may 
be  computed  in  ordinary  C,  which  is  advantageous  for  changing  from  one  model  to  another. 
Additionally,  potentially  complex  addressing  calculations  are  performed  only  once,  during 
initialization. 

2.  The  pointers  are  dereferenced  using  the  indirect  addressing  capability  of  the  vector  units. 

3.  Each  site  masks  off  the  appropriate  bits  from  it’s  neighbors  and  accumulates  these  bits  in  a 
register. 

4.  After  all  the  bits  from  all  the  directions  have  been  accumulated  the  value  is  written  to  memory. 
Memory  is  double  buffered  so  that  only  old  values  are  used  in  the  composition  of  new  ones. 


Here  is  an  actual  DPEAC  code  fragment  that  performs  these  operations: 


loadv  v  u(du,vlen,RO  addr,8,V6[0] ); 
R0.addr-+=  R  bump; 
l oadv_i ( u, Sv7v6, V8) ; 
loadv- i(u,Sv,V7,V9); 
andv(du, V8, SCALAR (R 126), V2); 

loadv_v_u(du,vlen,R1_addr,8,V6[0J ); 
RO  addr-+=  R_bunp; 

lo5dvJ<u,Sv7v6,V8); 
l  oadO  (  u,  Sv,  V7,  V9)  ; 
andv(du,V8,SCALAR<R126),V4); 

addy(du,V2,V4,V2>; 


/*  load  pointers  to  RO  neighbors  */ 
/*  bump  pointer  table  address  V 
/*  load  RO  neighbor  data  V 
/*  load  RO  neighbor  data  */ 

/*  RO  neighbor  &  SCALARIR126)  */ 

/*  load  pointers  to  R1  neighbors  */ 
/*  bump  pointer  table  address  */ 

/*  load  Rl  neighbor  data  V 
/*  toad  Rl  neighbor  data  V 
/*  Rl  neighbor  &  SCALAR(R124)  •/ 

/*  accumulate  Rl  bits  */ 


storev_v_u(du, vl en, SHEXT_addr , 8, V2 10] ) ;  /*  write  out  accumulated  results  */ 


The  first  load  instruction  loads  2  32-bit  pointers  into  each  element  of  vector  unit  register  V6,  since  it  is 
a  double  unsigned  (du)  format  instruction.  Loading  2  single  precision  quantities  as  one  double  precision 
quantity  doubles  the  effective  memory  bandwidth.  These  two  32-bit  pointers  show  up  as  single  precision 
quantities  in  V6  and  V7.  The  addresses  in  V6  and  V7  are  then  used  as  offsets  to  load  the  actual  site  data 
from  neighbors  in  the  RO  direction  into  V8  and  V9.  There  are  two  indirect  load  instructions  since  we 
only  want  to  load  unsigned  single  precision  (u)  quantities.  Then  the  bits  from  the  neighboring  site  are 
subjected  to  a  logical  and  operation  which  selects  the  bits  that  stream  to  the  present  site.  Double 
precision  masks  have  already  been  loaded  into  registers  R126,R124  for  this  purpose.  This  process  is 
repeated  for  all  the  lattice  directions  and  the  incoming  bits  are  accumulated  in  a  register  using  an  add 
operation.  After  all  the  streaming  has  been  completed  the  final  accumulated  results  are  written  to 
memory  as  a  double  precision  quantity. 


6.  COLLISIONS 


The  collision  phase  can  be  handled  via  look  up  tables  (LUT’s)  for  16  bit  sites.  The  LUT  is  attractive  in 
that  it  can  be  an  extremely  simple  and  fast  update  mechanism.  We  have  distributed  the  LUTs  throughout 
the  machine,  indeed  each  vector  unit  has  it’s  own  copy  of  the  LUT.  Figure  4  shows  the  memory  layout 
on  each  node.  During  the  collision  phase  each  vector  unit  fetches  all  the  sites  in  it’s  partition  of  the 
problem  space  and  runs  them  through  its  copy  of  the  LUT.  Since  each  vector  unit  has  it’s  own 
independent  128  Mbyte/sec  data  path  to  a  bank  of  memory,  this  operation  can  be  performed  extremely 


6 


rapidly.  With  this  high  degree  of  parallelism  the  LUT  operation  consumes  a  small  fraction  of  the  time 
necessary  to  update  the  space.  As  the  number  of  bits  of  site  data  grows  beyond  16  (64K  entries),  the 
LUT’s  begin  to  consume  too  much  memory.  For  models  that  involve  larger  quantities  of  site  data  (i.e. 
\#  bits  $>$  20)  other  methods  involving  LUT  compression/decompression  need  to  be  used  for  the 
collision  phase  [Henon,  1992], 
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Data 
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Figure  4.  Node  Memory  Layout 


Here  is  a  GCC/DPEAC  code  fragment  that  performs  collisions.  This  code  assumes  that  we  have  loaded 
V2  with  sites  that  are  ready  for  collisions. 


mulv(u,V2,SCALAR(R1V1),V6);  /*  multiply  new  state  by  4  to  use  */ 

/*  as  byte  offset  into  lookup  table*/ 

loadv_i(u,clutv,V6,V8);  /*  perform  indirect  load  from  lookup  table  */ 

The  code  fragment  shows  that  only  a  multiply  and  an  indirect  load  are  necessary  to  perform  a  lookup 
table  based  collision. 


7.  AN  EXAMPLE  CALCULATION 


Here  is  an  example  of  how  to  conduct  a  particular  simulation  experiment  using  CMCAM.  This 
calculation  will  utilize  an  8  bit  variant  of  the  FHP  model,  which  contains  a  rest  particle  and  obstacle  bit. 
The  bit  definitions  are  contained  in  fhp_hood.h.  Initial  conditions,  boundary  conditions  and  any  required 
forcing  for  the  flow  need  to  be  specified. 

Let’s  consider  a  specific  flow  experiment  in  some  detail.  The  flow  to  be  examined  is  channel  flow  with 
a  flat  plate  obstacle.  Figure  5  shows  a  diagram  of  this  situation.  The  first  consideration  is  that  we  need 
a  steady  flow  directed  towards  the  right  of  the  diagram.  Viscous  dissipation  will  reduce  the  flow  velocity 
to  zero  unless  we  include  some  type  of  forcing  which  will  keep  the  fluid  moving  down  the  channel. 
Another  important  aspect  of  performing  this  channel  flow  experiment  is  what  to  do  about  the  inflow  and 
outflow  boundary  conditions.  If  a  simple  periodic  geometry  is  used,  disturbances  propagating  along  the 
channel  could  re-enter  it  and  eventually  dominate  the  simulation  behavior. 


7 


Flow  direction  i — [>  i — [> 


Figure  5.  Flat  plate  channel  flow. 


A  periodic  channel  geometry  that  incorporates  a  forcing  strip  located  at  the  leftmost  side  of  the  channel 
meets  the  requirements.  The  forcing  strip  completely  reconstructs  the  particle  distribution  in  the  strip 
at  each  time  step.  This  prevents  disturbances  from  propagating  repeatedly  around  the  channel.  The 
reconstruction  maintains  the  desired  net  velocity  and  density  in  the  strip.  Particles  that  emanate  from  the 
strip  are  free  to  propagate  in  either  direction  around  the  cylindrical  channel. 

Initial  conditions  may  be  constructed  in  two  ways.  Constructing  a  pattern 
file  via  an  external  C  program  is  one  way.  This  is  good  for  running  relatively 
small  simulations.  It  allows  easy  reproduction  of  many  experiments 

with  a  simple  external  program  that  needs  only  to  know  the  required  final  output  format.  The  second 
method  is  to  construct  a  C  routine  for 

CMCAM  that  will  construct  the  initial  conditions  at  runtime.  This  method  is 
good  for  performing  extremely  large  simulations,  where  the  initial  pattern 
data  might  be  too  large  to  conveniently  read  in.  The  usual  place  to  insert 
such  a  routine  is  in  init.pn.c. 

Let's  examine  the  code  for  the  initial  conditions.  This  subroutine  will 

first  fill  the  simulation  space  with  a  fluid  at  a  filling  fraction  of  1/7 

with  a  net  velocity  of  0.4  momentum  units  per  site  in  the  positive  x  direction. 


void  init_plate(int  If,  int  Ij,  float  speed) 

< 

int  i.j; 

/*  initializes  flat  plate  */ 

I*  experiment  at  a  filling  fraction  of  1/7  */ 

/*  all  nodes  execute  this  code  */ 

/*  loop  over  the  entire  space  on  a  node  */ 
for(i  *  0;  i  <  li;  ++i)< 
for(j  a  0;  j  <  Ij;  ++j){ 

/*  put  rest  particles  everywhere  */ 

PIPSCi , j,BlT6,Sv); 


/*  in  10X  of  cases  replace  2  rest  */ 
/*  particles  with  two  oppositely  */ 
/*  directed  particles  V 
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if(zranUO)  <  O.IK 

/*  avoid  giving  particles  a  negative  x  coordinate  */ 
if<j  >  OK 
if(zran1<0)  <  0.5K 
PIPS{1, j,BIT1,Sv); 

PIPS(i, j- 1,81X4, Sv); 

> 

elseC 

PlPS(i, j,BIT0,Sv); 

PIPS(i, j-1,BIT3,Sv); 

> 

) 

> 

/*  insert  particles  with  1  unit  of  */ 

/*  forward  momentun  to  obtain  an  average  */ 
/*  momentun  per  site  of  [speed]  */ 
if(zran1(0)  <  speedK 
if(zran1(0)  <  0.5K 
PIPS(i, j,BIT3,Sv); 

> 

else< 

PIPS(i, j,8IX4,Sv); 

) 

> 

/*  insert  bits  for  plate*/ 
if(self_address  ==  (partition_size  /  8)  K 

/*  this  code  only~executes  on  one  node  */ 
for(j  =  0;  j  <  5;  *+j)< 

for(i  =  li/2  -  li/16;  i  <  li/2  ♦  li/16;  +♦!)< 

PIPS(i , j ,BIT7,Sv); 

> 

1 

> 


/*  add  hard  walls  at  top  and  bottom  of  channel  */ 

for(j  =0;  j  <  Ij;  ++j)< 
i  =  0; 

PIPS(i,j,BIX7,Sv>; 
i  =  li  -  1; 

PIPS( i , j,BIT7,Sv); 

> 

> 


An  important  thing  to  keep  in  mind  is  that  this  code  runs  on  every  processing  node  involved  in  the 
simulation.  The  loop  boundaries  li  and  Ij  denote  the  boundaries  of  each  processing  node’s  piece  of  the 
simulation  space.  The  PIPS  macro  is  used  to  load  the  values  into  the  memory  of  each  of  the  4  vector 
units  on  each  processing  node.  Note  the  self  address  conditional  test  where  the  flat  plate  obstacle  is 
inserted.  Each  processing  node  has  a  unique  variable  called  self  address.  The  code  that  inserts  the  flat 
plate  obstacle  is  only  run  on  the  node  with  a  specific  address.  This  test  insures  that  the  flat  plate  is 
inserted  at  only  one  point  in  the  channel. 

Now  that  we  have  specified  the  initial  conditions  let’s  examine  the  forcing  scheme  that  keeps  the  fluid 
moving  through  the  channel.  As  was  previously  discussed  we  will  use  a  forcing  scheme  that  not  only 
forces  the  fluid,  but  also  acts  to  prevent  disturbances  from  recirculating  around  the  channel.  The  code 
is  similar  to  the  initial  conditions  routine  described  above  and  would  usually  reside  in  forcing.pn.c. 
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void  force  jsl«te< float  speed,  int  li,  int  Ij) 

Int  i, j; 

Int  vat; 

Int  r; 

/*  this  forcing  reconstructs  */ 

/*  the  distribution  at  the  forcing  strip  */ 

/*  each  time  step  */ 

/*  assunes  1/7  filling  fraction  */ 

/*  forcing  strip  in  node  0  */ 

/*  forces  flow  in  positive  x  */ 

i f ( se l f _addr ess  **  OK 
for(i~*  0;  i  <  li;  ♦*i)< 

/*  first  lay  down  isotropic  background  */ 

/*  choose  a  random  bit  to  set  */ 
r  ■  (intKzranl(O)  *  7.0); 
val  «  1«r; 

/*  put  resulting  value  into  vector  memory  */ 

PIPS(i,2,val,Sv); 

> 

for(i  a  0;  i  <  li;  ♦♦»)< 

if(zranKO)  <  speed  K  /*  put  in  +x  speed  particles  */ 

/*  to  get  appropriate  speed  */ 

if<zran1(0)  <  0.5  X 
PlPS(i,2,BIT3,Sv); 

> 

else< 

PIPS(i,2,BIT4,Sv); 

> 

) 

> 

> 

> 


There  is  a  self  address  test  that  makes  sure  the  forcing  is  only  done  on  one  node.  The  rest  of  the  code 
merely  reconstructs  a  strip  of  fluid  with  a  given  net  velocity.  This  routine  is  called  after  each  time  step 
to  maintain  a  steady  flow. 


To  compute  the  flow  velocity  at  each  point  we  use  pre-computed  lookup  tables  that  map  the  individual 
site  states  to  floating  point  numbers  that  specify  density,  \  momentum,  y  momentum  and  so  on.  These 
tables  are  computed  in  tabulate\_states.pn.c. 


Nearly  all  the  simulation  ingredients  are  in  place,  we  have  completely  specified  boundary  conditions  and 
external  forcing.  To  visualize  the  flow  we  need  to  extract  the  simulation  data  from  each  vector  unit’s 
memory,  compute  the  flow  velocity  at  each  point  and  transmit  an  image  to  the  front  end  host,  where  it 
can  be  displayed  or  written  to  disk  for  later  processing.  Figure  6  shows  some  results  from  running  the 
experiment  on  a  1  K  x  2  K  lattice.  The  coloring  indicates  only  the  direction  of  the  flow. 
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Figure  6.  A  Von  Karman  Vortex  Street  behind  a  flat  plate. 


8.  PERFORMANCE 


After  implementing  a  simulator  with  the  above  considerations  in  mind  we  find  that  we  can  achieve  update 
rates  on  the  order  of  550  Msites/sec  on  a  256  node  partition  of  the  CM-5  for  an  8-bit  FHP  gas  model. 
This  timing  was  done  using  a  2048  x  32768  lattice  with  a  model  that  packed  two  8-bit  sites  into  a  32  bit 
word.  We  find  that  the  longer  the  system  is  across  each  node  the  greater  the  performance  realized.  This 
is  due  to  the  fact  that  long  system  sizes  across  each  node  increase  the  fraction  of  sites  in  the  interior  of 
each  vector  unit  that  do  not  need  to  communicate  with  sites  on  adjacent  vector  units  or  processing  nodes. 

Extending  the  CMCAM  simulator  to  use  more  complex  models  can  result  in  substantial  performance 
penalties.  For  example  the  2-speed  thermohydrodynamic  model  of  Chen,  etal.  [1991]  requires  that  some 
bits  are  streamed  twice  as  fast  as  other  bits.  The  most  straightforward  way  to  implement  this  using 
CMCAM  is  to  perform  two  streaming  cycles  before  a  collision  cycle,  incurring  a  factor  of  2  performance 
loss. 
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APPENDIX  A.  CODING  CONVENTIONS 


11,1 j  Bounds  of  a  pn's  portion  of  simulation  space 

Sv  time  t  lattice  memory 

SNEXTv  time  t  +  1  lattice  memory 

PIPS(i, j,val,addr)  [Put  Into  Parallel  Space] 

Write  (val)  into  vu  memory  at  lattice  site  i,j 
This  macro  automatically  selects  the  appropriate  VU 
to  write  to. 

GPPS(i, j,addr)  [Get  From  Parallel  Space] 

Read  from  vu  memory  at  lattice  site  i,j  into  addr. 
This  macro  automatically  selects  the  appropriate  VU 
to  read  from. 


APPENDIX  B.  ARGUMENTS 


Recognized  Command  line  arguments  to  CMCAM: 

Required: 

-nsteps  (int)  ->  number  of  steps  to  run 

-report_freq  [int]  ->  display  and  statistics  gathering  interval 
-pattern  file  [fname]  ->  filename  of  initial  conditions  pattern 
-rlutfiTe  [fname]  ->  filename  of  the  right  handed  lut 

-llut_file  [fname]  ->  filename  of  the  left  handed  lut 

-sys_x  [int]  ->  system  x  dimension 

-sys_y  [int]  ->  system  y  dimension 

Optional : 

-forcing  [float]  ->  number  of  momentum  units  per  time  step  to  add  in 

each  direction 

-xdisplay  ->  if  present,  an  X  window  is  opened  and  displays  the  state 
every  report_freq  timesteps 

-cam_pattern  ->  the  pattern  file  is  a  CAM  format  pattern  file 

-colordisp  ->  color  display 

-display_table_f ile  ->  file  that  maps  automaton  states  to 

8-bit  color  indices 

-palette_f ile  ->  file  that  maps  8-bit  color  indices  to  RGB  colors 

-send_frames  [hostname: directory)  ->  enables  automatic  rep  of  generated 

frames  to  specified  host 


» 


« 
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APPENDIX  C.  CAM-8  COMPATIBILITY 


CMCAM  is  compatible  with  the  MIT  Information  Mechanics  Group’s  CAM-8  machine  in  the  following 
ways. 

a 

1.  The  format  of  the  collision  rule  lookup  table  is  identical.  LUT’s  produced  for  the  CAM  are 

usable  directly  by  CMCAM.  , 

2.  CAM  display  tables  and  color  palettes  are  directly  usable  by  CMCAM. 

3.  CAM  initial  pattern  data  is  consumable  by  CMCAM. 

Streaming  information  is  encoded  in  a  different  manner  on  each  machine  and  there  is  no  provision  for 
directly  transporting  code  from  one  machine  to  the  other. 
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APPENDIX  D.  OBTAINING  AND  RUNNING  THE  CODE 


CMCAM  source  code  may  be  currently  obtained  from  Guy  Seeley,  email  address: 
seeley@wind.plh.af.mil,  phone  (617)377-2475. 


1.  Uncompress  and  untar  the  file  containing  the  source  and  appropriate  supporting  files. 

2.  Type  make.  This  should  compile  all  the  modules  and  produce  an  executable  known  as  lutcm. 

3.  Make  sure  that  you  are  operating  on  a  color  X-Window  terminal  with  your  DISPLAY 
environment  variable  appropriately  set  and  access  to  your  own  X  display  enabled. 

4.  Type  sample jmn.  This  will  run  a  sample  calculation  on  a  32  node  partition  of  a  CM-5.  This 
calculation  will  display  an  X-window  view  of  the  simulation  every  250  time  steps. 
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